PIMA Indian Diabetes

Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.

Here in this project we explain and we try to predict whether a patient should be diagnosed with Diabetic or not. In this project we will apply a Fully Bayesian Logistic Regression model aimed at predicting the onset of diabetes mellitus in Pima Indians given diagnostic measurements.
Diabetes is a metabolic disorders characterized by a high blood sugar level (hyperglycemia) over a prolonged period of time. It is mainly due to the reduction of insulin production or lack of capability by the cells to properly respond to the insulin produced.

Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.

This dataset is composed by these features: The dataset used is Pima Indians Diabetes Dataset, publicly available on Kaggle : https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database, containing medical information of female patients belonging to Pima Indian heritage of age \(\geq\) 21.

Analyzing these features, we captured some interesting information for each feature within the dataset:

For each patient, the binary target variable Outcome indicates whether or not she is affected by Diabetes.

Exploratory Data Analysis

Missing data

The main features detected have these summaries:

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
Min. 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0780 21.0000
1st Qu. 1.0000 99.0000 62.0000 0.0000 0.0000 27.3000 0.2438 24.0000
Median 3.0000 117.0000 72.0000 23.0000 30.5000 32.0000 0.3725 29.0000
Mean 3.8451 120.8945 69.1055 20.5365 79.7995 31.9926 0.4719 33.2409
3rd Qu. 6.0000 140.2500 80.0000 32.0000 127.2500 36.6000 0.6262 41.0000
Max. 17.0000 199.0000 122.0000 99.0000 846.0000 67.1000 2.4200 81.0000
Missing Values 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Zeros 111.0000 5.0000 35.0000 227.0000 374.0000 11.0000 0.0000 0.0000

  

summary(data)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

let’s how variables are distributed:

# How variables are distributed
library(reshape2)
library(ggplot2)
gg <- melt(data)
ggplot(gg, aes(x=value, fill=variable)) +
  geom_histogram(binwidth=5)+
  facet_wrap(~variable)

There are no missing data, but several meaningless zero values, that we can consider as NaN (e.g., BloodPressure = 0). So, we replaced the zeros with the per-class median in the following fields:

  • Glucose
  • BloodPressure
  • BMI We build a model based on all variables and as we see here:
## 
## Call:
## lm(formula = Outcome ~ ., data = dataBP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01348 -0.29513 -0.09541  0.32112  1.24160 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.8538943  0.0854850  -9.989  < 2e-16 ***
## Pregnancies               0.0205919  0.0051300   4.014 6.56e-05 ***
## Glucose                   0.0059203  0.0005151  11.493  < 2e-16 ***
## BloodPressure            -0.0023319  0.0008116  -2.873  0.00418 ** 
## SkinThickness             0.0001545  0.0011122   0.139  0.88954    
## Insulin                  -0.0001805  0.0001498  -1.205  0.22857    
## BMI                       0.0132440  0.0020878   6.344 3.85e-10 ***
## DiabetesPedigreeFunction  0.1472374  0.0450539   3.268  0.00113 ** 
## Age                       0.0026214  0.0015486   1.693  0.09092 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4002 on 759 degrees of freedom
## Multiple R-squared:  0.3033, Adjusted R-squared:  0.2959 
## F-statistic: 41.29 on 8 and 759 DF,  p-value: < 2.2e-16

Predictors with p-values greater than 0.05 might be considered not significant and could potentially be removed from the model.In this case, SkinThickness and Insulin have a p-value greater than 0.05, suggesting that they might not be statistically significant predictors of the outcome. So, we removed them.

Feature Distribution

\[~\] Here it is represented the Box Plot for all the features

The Data Visualization

\[~\] In the following, representing the distribution of the different features. Please notice that the values reported refer to intra-class percentages, i.e., the counts are normalized by dividing for the carnality of the class they refer to. By doing so, we are able to compare the different categories regardless from the class unbalance of the data set.

Age

\[~\] Check what is the impact of age over the Outcome

ggplot(data,aes(x=Age,fill=factor(Outcome)))+geom_density(alpha=0.4)+scale_fill_manual(values=c("blue", "red"))+labs(title="Distribution of Age")

In this dataset, diabetes seems more common across people above age of 35; this is possibly because that Type I diabetes can arise at any age, while Type II diabetes is more common across people of age \(\geq 40\).

At young age Diabetes pedigree function was high, as age goes on its get reduced.

ggplot(data,aes(x=cut(Age,breaks=5),y=DiabetesPedigreeFunction,fill=cut(Age,breaks=5)))+geom_boxplot()+scale_fill_brewer(palette="RdBu")

Glucose

\[~\]

Not surprisingly, diabetic people have an higher glycemic index, being this is one of the hillness’ symptoms.

Check the relationship between Glucose and BP

# Imports
dataBP <- read.csv("C:/Users/user/OneDrive/Документы/diabetes.csv", header = TRUE)
ggplot(dataBP,aes(x=Glucose,y=BloodPressure,size=Age,color=Outcome))+geom_jitter(alpha=0.6)+scale_color_gradient(low = 'red', high = 'blue')+labs(title="BP and Glucose level against Age")

Body Mass Index(BMI)

\[~\]

Diabetes is more common among obese people; indeed obesity is one of the risk factor for developing Type II diabetes.

col =  list('Diabetic' = 'blue', 'Healthy'= 'red')

ggplot(data,aes(x=BMI,fill=factor(Outcome)))+stat_density(alpha=0.6)+scale_fill_manual(values=c("blue", "red"))+labs(title="Distribution of BMI")+theme(legend.position="bottom",title=element_text("Outcome"))

###Pregnancy \[~\] Number of Pregnancies has an impact over diabetes outcome

#Number of Pregnancies has an impact over diabetes outcome
ggplot(data,aes(x=Pregnancies,fill=factor(Outcome)))+geom_bar(position="Dodge")+scale_fill_manual(values=c("blue","red"))+scale_x_continuous(limits=c(0,16))+labs(title="Pregnancies Vs Outcome")

Relationship between Glucose and Diabetes Pedigree Function vs Pregnancies

#Relationship between Glucose and Diabetes Pedigree Function vs Pregnancies
ggplot(dataBP,aes(x=Glucose,y=DiabetesPedigreeFunction,color=Pregnancies))+geom_point()+scale_color_gradient(low = 'red', high = 'blue')

Blood Pressure

\[~\]

Diabetic people seem to suffer more from hypertension than non-diabetic patients. This may be due to the fact that some forms of diabetes (Type II) are associated with high weight, which on turn may cause high blood pressure.

Diabetes Pedigree Function

\[~\]

Finding the correlation between the attributes

# Finding the correlation between the attributs
library(ggcorrplot)

corr<-round(cor(dataBP),1)
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("red", "white", "blue"), 
           title="Correlogram of Diabetes data", 
           ggtheme=theme_bw)

The features in our dataset are weakly correlated, hence in the regression, there won’t be Multicollinearity problems.Age and number of Pregancies are weakly Diabeticly correlated (\(\rho \approx 0.5\) ). This is not surprising since elder women are more likely to have had an higher number of Pregnancies during their life.

ggplot(dataBP,aes(x=Insulin,y=Glucose))+geom_point(aes(color=Outcome))+geom_smooth()+scale_color_gradient(low = 'red', high = 'blue')

Class distribution

\[~\]

In order to overcome the severe class imbalance of our data, we applied SMOTE: a data augmentation strategy consisting in oversampling the minority class (Diabetic in our case).

Eventually we obtained a dataset with the same statistics as above, but more balanced!

Preliminary Brief Definitions

\[~\] - Likelihood \(\pi(y_{obs}|\theta)\): measures the goodness of fit of a statistical model to a sample of data for giving values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.

The Model

\[~\] We will use Bayesian Logistic Regression to model the relationship between the binary target variable \(Y \in \{0,1\}\) (diabetic = 1, healthy = 0) with the input features \(x_j\ \ ,\ \ \ j=1,...,p\):

feature name variable name
Pregnancies x1
Glucose x2
BloodPressure x3
BMI x4
DiabetesPedigreeFunction x5
Age x6
Outcome Y

Moreover, in the following we will consider the canonical notation: \(x_0=1\) in order to model the intercept. Differently from Bayesian linear regression, there is no variance term to be estimated, and only the regression parameters (\(\beta_j\)) will be estimated.

For each patient, the target variable can be modeled as a Bernoulli: \[ Y|\boldsymbol{x} \sim Ber(\theta_{\boldsymbol{x}})\] Hence: \[ \Pr(Y = 1|\boldsymbol{x}) = \mathbb{E}[Y = 1|\boldsymbol{x}] = \theta_\boldsymbol{x} \] In logistic regression, the Link function that links \(\mathbb{E}[Y = 1|\boldsymbol{x}]\) with the linear predictor \(\boldsymbol{\beta \cdot x} = \sum_{j} \beta_j \cdot x_j\) is the Logit Function:

\[ \theta_{\boldsymbol{x}} = \text{ilogit}\left(\boldsymbol{\beta \cdot x} \right) \iff \boldsymbol{\beta \cdot x} = \text{logit} \left(\theta_{x}\right) \]

where: \[ \text{ilogit}\left(z\right) = \frac{\exp(z)}{1+\exp(z)} \] is the sigmoid function, forcing \(\boldsymbol{\beta \cdot x_i}\) in the \([0,1]\) range;

and \[\text{logit}(\pi) = \log\left(\frac{\pi}{1-\pi}\right)\] Is the logit function (inverse of the sigmoid) computing the logarithm of the odds.

Likelihood

The overall dataset is randomly shuffled and sliced in Train and Test sets according to a 80/20 split.

  • The Training examples represent the observations in our Bayesian Analysis.
  • The Test set will be used to measure the performances of our model and compare them with the ones achieved by other classification ML techniques.

Let’s say we have \(n\) observations; we can define their likelihood function as follows: \[ L_{\boldsymbol{y}}(\boldsymbol{\beta}, \boldsymbol{X}) = f(\boldsymbol{y}|\boldsymbol{\beta},\boldsymbol{X}) = \prod_{i= 1}^n f(y^{(i)}|\boldsymbol{\beta,x^{(i)}}) = \prod_{i= 1}^n \text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}}) ^{y^{(i)}}\left(1-\text{ilogit}( \boldsymbol{\beta \cdot x^{(i)}})\right)^{1-y^{(i)}} \]

Joint prior of the parameters

\[~\] The joint prior distribution of the parameters on \(\mathbf{B}^p = \{(-\infty, +\infty)\}^{p}\) can be computed as the product of their marginal distributions:

\[ \pi(\boldsymbol{\beta}) = \prod_{j= 1}^{p} \pi(\beta_j) = \pi(\beta_0)\cdot \pi(\beta_1)\cdot ...\cdot \pi(\beta_p) \]

In our case we will consider two kind of priors, expressing our belief that each \(\beta_j\) represents the true population characteristics (under our modellistic assumptions), prior to the observation of any data :

  • Normal distribution:

\[ \boldsymbol{\beta} \sim N_p(\boldsymbol{\mu}, \mathbb{I}_p\sigma^2) \iff\beta_j \overset{iid}{\sim} N\left(\mu_j, \sigma_j^2= 10^4\right) \ \ \ j = 1,..., p \]

  • Laplace distribution:

\[ \beta_j \overset{iid}{\sim} \text{Laplace}\left(\mu_j,b_j = \frac{10^2}{\sqrt2}\right) \ \ \ j = 1,..., p \] For what concernes the hyperparameters, we consider both Normal and Laplace distribution centered in 0 (\(\mu_j = 0 \ , \ \ j = 1,..., p\)) Since we have weak prior knowledge, a diffuse prior distribution is used, spreading more or less evenly the probability distribution over \(\beta_j\), which is modeled in terms of assigning an high variance to the prior:

  • Gaussian: \(\sigma^2_j = 10^4 \ , \ \ j = 1,..., p\) .
  • Laplace: \(b_j = \frac{10^2}{2} \iff \mathbb{V}\text{ar}(\beta_j)= 10^4 \ , \ \ j = 1,..., p\) .

We will then compare the Deviance Information Criterion of the two models to select the most fitting one.

Jags Models

\[~\]

Normal

\[~\] model { for (i in 1:N){ Y[i] ~ dbern(p[i]) p[i] <- ilogit(beta0+beta1x1[i] + beta2x2[i] + beta3x3[i] + beta4x4[i] + beta5x5[i] + beta6x6[i])

  }
  
  # Defining the prior beta parameters
  beta0 ~ ddexp(0, 1.0E-4)
  beta1 ~ ddexp(0, 1.0E-4)
  beta2 ~ ddexp(0, 1.0E-4)
  beta3 ~ ddexp(0, 1.0E-4)
  beta4 ~ ddexp(0, 1.0E-4)
  beta5 ~ ddexp(0, 1.0E-4)
  beta6 ~ ddexp(0, 1.0E-4)

}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.6472 0.7593 -10.1110 -9.1532 -8.6436 -8.1331 -7.2100 1.0005 3000
beta1 0.1168 0.0316 0.0581 0.0951 0.1164 0.1389 0.1787 1.0005 3000
beta2 0.0354 0.0034 0.0289 0.0330 0.0354 0.0376 0.0423 1.0008 3000
beta3 -0.0062 0.0081 -0.0221 -0.0117 -0.0063 -0.0007 0.0101 1.0008 3000
beta4 0.1027 0.0150 0.0738 0.0924 0.1024 0.1128 0.1325 1.0007 3000
beta5 0.8973 0.2855 0.3435 0.7077 0.8944 1.0909 1.4522 1.0019 1400
beta6 0.0130 0.0094 -0.0048 0.0067 0.0128 0.0193 0.0314 1.0007 3000
deviance 849.4309 3.9556 844.1442 846.6033 848.7129 851.5262 858.6929 1.0007 3000
DIC pD
857.2577 7.8269

We know that \(\beta_i\) measures the marginal impact of the \(X_i\) on the odds in favor of \(Y\). Fixed this concept, we can comment the results obtained.

  • Mean is the point estimate for \(\beta_i\)
  • Sd is the standard deviation
  • Rhat is the potential scale reduction and it is a measure of the convergence of the chain; it is a comparison between chain variance and whithin chain variance; if they are similar they come from the same distribution. As we can see in our case, this value is very close to 1 for each predicted value
  • n.eff is the effective sample size that can be considered as the number of independent Monte Carlo samples necessary to same precision of the MCMC samples. The greater is this value, the lower the autocorrelation between the MCMC steps is, and the better is the final approximation
  • DIC is the Deviance Information Criteria , which balances the model fit and complexity. It is calculated as the sum of the deviance and the effective number of parameters (pD). Lower DIC values indicate a better trade-off between fit and complexity.
  • deviance is a measure of goodness of the model fit at different steps of the chain. We want it to be stationary to be able to use it for DIC and understand the goodness of the model

Laplace

\[~\] # MODEL SPECIFICATION

model 
{
  for (i in 1:N){
    Y[i] ~ dbern(p[i])
    p[i] <- ilogit(beta0+beta1*x1[i] + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta5*x5[i] + beta6*x6[i])
    
  }
  
  # Defining the prior beta parameters
  beta0 ~ ddexp(0, 1.0E-4)
  beta1 ~ ddexp(0, 1.0E-4)
  beta2 ~ ddexp(0, 1.0E-4)
  beta3 ~ ddexp(0, 1.0E-4)
  beta4 ~ ddexp(0, 1.0E-4)
  beta5 ~ ddexp(0, 1.0E-4)
  beta6 ~ ddexp(0, 1.0E-4)

}
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.6299 0.7467 -10.0184 -9.1809 -8.6562 -8.0994 -7.2196 1.0119 960
beta1 0.1165 0.0317 0.0552 0.0957 0.1162 0.1370 0.1792 1.0012 3000
beta2 0.0350 0.0033 0.0288 0.0327 0.0350 0.0371 0.0416 1.0049 470
beta3 -0.0061 0.0081 -0.0226 -0.0113 -0.0061 -0.0004 0.0092 1.0058 480
beta4 0.1032 0.0152 0.0732 0.0931 0.1034 0.1133 0.1325 1.0111 590
beta5 0.8920 0.2866 0.3391 0.7014 0.8790 1.0800 1.4671 1.0013 2600
beta6 0.0133 0.0093 -0.0044 0.0070 0.0132 0.0192 0.0320 1.0031 3000
deviance 849.3543 3.7506 844.0283 846.6301 848.7282 851.3682 858.7077 1.0055 400
DIC pD
856.3575 7.0032

The 95%-CI for \(\beta_3\) in both models contains the zero, hence the parameter does not matter in our analysis in the relation between the covariates and the outputs. Let’s have a look to the model if we remove the variable \(x_3\) = Blood Pressure:

Normal Removing Null features

\[~\]

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.8640 0.6967 -10.2138 -9.3367 -8.8709 -8.3993 -7.4989 1.0008 3000
beta1 0.1158 0.0314 0.0548 0.0951 0.1150 0.1367 0.1780 1.0008 3000
beta2 0.0348 0.0034 0.0283 0.0325 0.0348 0.0371 0.0417 1.0006 3000
beta4 0.0995 0.0142 0.0718 0.0898 0.0999 0.1092 0.1275 1.0019 1400
beta5 0.9061 0.2868 0.3602 0.7104 0.9000 1.0992 1.4678 1.0023 1100
beta6 0.0113 0.0090 -0.0060 0.0051 0.0112 0.0175 0.0289 1.0011 3000
deviance 848.8892 3.4988 844.1271 846.4120 848.1873 850.6570 856.9820 1.0036 830
DIC pD
854.9993 6.11

Laplace Removing Null features

\[~\]

mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
beta0 -8.8867 0.7055 -10.3427 -9.3491 -8.8694 -8.4005 -7.5533 1.0098 220
beta1 0.1151 0.0321 0.0497 0.0936 0.1151 0.1369 0.1783 1.0023 1100
beta2 0.0348 0.0034 0.0280 0.0326 0.0349 0.0371 0.0414 1.0042 590
beta4 0.0998 0.0153 0.0700 0.0894 0.0998 0.1104 0.1299 1.0107 220
beta5 0.9119 0.2882 0.3601 0.7118 0.9126 1.0993 1.4859 1.0017 3000
beta6 0.0116 0.0093 -0.0061 0.0054 0.0114 0.0179 0.0297 1.0058 380
deviance 849.1369 3.6672 844.1308 846.4201 848.5044 851.0922 858.1590 1.0026 2000
DIC pD
855.8589 6.722

The criterion used for selecting the best model is the Deviance Information Criterion (DIC): a comparative index used for choosing among competing models. It is a measure of goodness of fit of the model (average deviance), discounted by a penalization term,representing the number of parameters (pD):

\[ DIC = p_D+ \hat{D}_{\text{avg}}(\theta) \] where:

\[\hat{D}_{\text{avg}}(\theta) \approx \frac{1}{M} \sum_{i=1}^M -2 \log\left(f\left(y|\theta^{(j)}\right) \right) \]

The model with the smallest DIC assumes Normal Priors and excludes the variable x3; hence, the following analysis will be referred to such setup.

Normal Model

\[~\] - Traceplot: The traceplots allow to assess the convergence of the markov chain by showing the time series of the chain. All chains seem to be exploring the same region of parameter values, which is a good sign. The expected outcome is to observe a stationary behavior after a number of transitions, which translates into regular oscillation within a certain range due to the chain reaching the target distribution.

  • Density plot: Depicts the simulated sampling distribution of the parameters, for each chain. The yellow vertical line represents the empirical mean, while the horizontal one the 95% quantile based credible interval based on the joint results of all of the three chains.

  • Autocorrelation Function Plot : The Auto Correlation Function(ACF) plots visualize how much the correlation between the simulated values holds in the previous states.

    • On the \(x\)-axis we have the Lag \(h\)
    • On the \(y\)-axis it’s reported \(\rho(h)\), estimate of the correlation among coordinates in the stochastic process which is thought to be stationary: \(\mathbb{C}\text{orr}(\beta_t,\beta_{(t+h)} )\)

If the process becomes stationary, we expect the correlation to vanish as we increase the lag \(h\). In the limit, if \(\rho=0\) the simulations are iid (which of course is not our case).

Cummulative Means

\[~\] One of the main goals of performing MCMC is to approximate quantities which are functionals of the target distribution \(\pi(\cdot)\), such as the empirical mean: \[ I = \mathbb{E}_{\pi}[\beta_{j}] \approx \hat{I_t}= \sum_{i=1}^t\frac{\beta_{i}}{t} \]

Where we assume the sample \(\beta_{j}^{(1)}, ..., \beta_{j}^{(t)}\) to be simulated from suitable Markov Chain: invariant w.r.t. \(\pi(\cdot)\). Such condition is satisfied either (i) by considering as starting distribution the stationary distribution or (ii) - under the regularity conditions make the ergodic behavior (SLL) possible - by taking only the samples following the burn-in time \(T_0\), a suitable time such that \(\beta_{j}^{(T_0-1)} \approx \pi(\cdot)\).

By looking the plots we can see that the line that quickly approaches the overall mean. The point in which this happens is precisely the burn-in time.As we can see above each parameter achieves in all of the three chains generated the same end point, so means that with different initial points in these three chains, we are strictly going to have the same estimated mean parameter.

Posterior uncertainty

\[~\] As criterion to establish the parameter with the larges the posterior uncertainty, we can look at the width of the quantile-based credible intervals (95%):

Quantile Based
LB UB width
beta0 -10.3427 -7.5533 2.7894
beta1 0.0497 0.1783 0.1285
beta2 0.0280 0.0414 0.0135
beta4 0.0700 0.1299 0.0599
beta5 0.3601 1.4859 1.1258
beta6 -0.0061 0.0297 0.0359
Highest Posterior Density
LB UB width
beta0 -10.3184 -7.5380 2.7804
beta1 0.0483 0.1760 0.1276
beta2 0.0280 0.0414 0.0134
beta4 0.0714 0.1309 0.0595
beta5 0.3445 1.4715 1.1270
beta6 -0.0063 0.0295 0.0358

As we can notice, in both cases the parameter associated to the widest credible interval is the intercept: \(\beta_0\).

Correlation between parameters

\[~\]

The most correlated couple of parameters is \(\beta_4\) and \(\beta_0\).

Approximation Errors

\[~\] The empirical mean is an unbiased estimator; therefore the MSE is equal to its variance. Differently from the Vanilla Monte Carlo case, in which the estimate are computed on iid samples, in the MCMC each realization of the sample depends on the previous one, which on turns depend on its prior and so on.. As a consequence, the variance of the empirical mean of the simulated values cannot be computed as the simulated values divided by the number of simulations, but we must take into account the covariance between each pair of simulated values. The higher the correlation, the larger will be the variance of the approximation provided by the MCMC algorithm with respect to the variance of the iid simulations proper of the MC method. This implies that the Markov Chain is more inefficient than standard iid simulation. This can be quantified by the effective sample size (ESS):

\[ t_{eff}=\frac{t}{\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \]

Used to normalize the variance of \(h(\beta_j)\) and get the variance of \(\hat{I}_t\):

\[ \sigma^2_{\hat{I}_t} = \mathbb{V}\text{ar}[\hat{I}_t] =\frac{\mathbb{V}\text{ar}_{\pi}(h(\beta_j))}{t_{eff}} = {\left(1+2\sum_{k=1}^{+\infty} \rho_k \right)} \cdot \frac{\sigma^2}{t} \]

Another estimate of the inaccuracy of the MC samples is the Monte Carlo Standard Error (MCSE). It measures the standard deviation around the posterior mean of the samples due to the uncertainty of the MCMC algorithm.

ESS IFiid MCMCerror MCSEerror
beta0 1000 0.0005 0.0005 0.0216
beta1 1000 0.0000 0.0000 0.0010
beta2 1000 0.0000 0.0000 0.0001
beta4 1000 0.0000 0.0000 0.0004
beta5 1000 0.0001 0.0001 0.0091
beta6 1000 0.0000 0.0000 0.0003
deviance 1000 0.0122 0.0122 0.1178

Model Diagnostics

\[~\] In the following are reported some Convergence Diagnostics, aimed at verifying the presence of converge issues, which might suggest to enlarge the number of simulations or use some other types of parametrizations. We are considering different tools, such as:

  • The Gelman-Rubin Diagnostic
  • The Geweke Diagnostic
  • The Heidelberg & Welch Diagnostic
  • The Raftery & Lewis Diagnostic

Gelman-Rubin Diagnostic

\[~\] Gelman-Rubin diagnostic evaluates the convergence by analyzing the difference between multiple Markov chains. Let:

  • \(M\) : Number of chians
  • \(T\): Number of iterations for each chain.

For each parameter, we compare the between-chains and within-chain estimated variances; large differences between them indicate nonconvergence.

  • Between-chain variance: \[ B_T = \frac{1}{M} \sum_{m=1}^M (\hat{I}^{(m)} - \hat{I}_T)^2 \]
  • Within-chain variance:

\[ W_T = \frac{1}{M} \sum_{m=1}^3 \left[ \frac{1}{T} \sum_{t=1}^T \left(h\left(\beta_t^{(m)}\right) -\hat{I}_T^{(m)}\right ) \right] \]

where:

  • \(\hat{I}_T^{(m)} = \frac{1}{T} \sum_{t=1}^{T} h\left(\beta_t^{(m)}\right)\)
  • \(\hat{I_T}\) is the overall estimator using the values coming from all of the \(M\) chains

Based on these measures, we compute the Potential Scale Reduction factor: \[ \hat{R_T} = \sqrt{ \frac{T-1}{T} W_T + \frac{B_T}{n}} \]

The model is healthy if \(\hat{R}_T \approx 1\) and is below some threshold (usually \(\hat{R}_T < 1.1\)).

Gelman-Rubin diagnostic
Point Estimate Upper CI
beta0 1.000 1.001
beta1 1.001 1.005
beta2 1.000 1.002
beta4 1.001 1.006
beta5 1.001 1.004
beta6 1.001 1.004
deviance 1.003 1.009
Multivariate PSRF
1.005

Geweke Diagnostics

\[~\] Geweke Diagnostics consists in an inter chain convergence check. Compares the first 20% of the chain after burnin with the half last 50% percent; If the samples are drawn from the stationary distribution of the chain, the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution. The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error. The standard error is estimated from the spectral density at zero, and so takes into account any autocorrelation.

The Z-score is calculated under the assumption that the two parts of the chain are asymptotically independent.

\[ Z = \frac{\hat{\beta}_A-\hat{\beta}_B}{\frac{1}{T_A}\hat{S}^{A}_{\beta}(0)+\frac{1}{T_B}\hat{S}^{B}_{\beta}(0)} \] where \(A\),\(B\) are two windows within the Markov chain and the standard error is estimated from the spectral density at zero and so takes into account any autocorrelation. If \(|Z|<1.96\) we accept the null hypothesis.

Z-scores for Geweke diagnostic
Chain 1 Chain 2 Chain 3
beta0 0.509 0.378 -0.066
beta1 0.154 -0.097 -1.011
beta2 -1.221 -0.114 -0.821
beta4 0.163 -0.651 0.417
beta5 0.020 0.396 0.742
beta6 0.642 -0.546 0.846
deviance -0.019 -0.547 -0.022

Heidelberg & Welch diagnostics

\[~\] Uses a test statistic to verify the null hypothesis that the values sampled from the Markov Chian come from a stationary distribution. It composes of two parts:

  1. Stationary Test:

    1. Define an \(\alpha\) level.
    2. Test he null hypothesis that the chain is from a stationary distribution by calculating the Cramer-von-Mises statistic on the whole chain.
    3. If the null hypothesis is rejected, then discard the first 10% of the chain.
    4. Repeat the test, each time discarding the next 10% in case of rejection, until either the null hypothesis is accepted or 50% of the chain is discarded. If the test still rejects the null hypothesis, then the chain fails the test and needs to be run longer.
  2. Halfwidth Test: If the chain passes the first part of the diagnostic, then the part of the chain that was not discarded from the first part is used to test the second part.

Chain 1
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.711 pass -8.864 0.043
beta1 pass 1 0.773 pass 0.117 0.002
beta2 pass 1 0.462 pass 0.035 0.000
beta4 pass 1 0.828 pass 0.100 0.001
beta5 pass 1 0.434 pass 0.914 0.018
beta6 pass 1 0.288 pass 0.011 0.001
deviance pass 1 0.160 pass 848.991 0.217
Chain 2
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.986 pass -8.880 0.040
beta1 pass 1 0.880 pass 0.115 0.002
beta2 pass 1 0.809 pass 0.035 0.000
beta4 pass 1 0.806 pass 0.100 0.001
beta5 pass 1 0.386 pass 0.889 0.014
beta6 pass 1 0.786 pass 0.011 0.001
deviance pass 1 0.689 pass 848.647 0.217
Chain 3
Stationarity test start iteration p-value Halfwidth test Mean Halfwidth
beta0 pass 1 0.947 pass -8.848 0.044
beta1 pass 1 0.230 pass 0.116 0.002
beta2 pass 1 0.302 pass 0.035 0.000
beta4 pass 1 0.718 pass 0.099 0.001
beta5 pass 1 0.632 pass 0.915 0.018
beta6 pass 1 0.506 pass 0.011 0.001
deviance pass 1 0.965 pass 849.030 0.228

###Raftery & Lewis Diagnostic \[~\] This Introduces a MCMC diagnostic that estimates the number of iterations needed for a given level of precision in posterior samples.

coda::raftery.diag(coda.fit)
## [[1]]
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
## 
## You need a sample size of at least 3746 with these values of q, r and s
## 
## [[2]]
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
## 
## You need a sample size of at least 3746 with these values of q, r and s
## 
## [[3]]
## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95 
## 
## You need a sample size of at least 3746 with these values of q, r and s

As we can see above, we need N samples to achieve these performances in these different chains.

Validation

\[~\] In order to validate our model, we can create, through simulation, a synthetic dataset from a distribution whose parameters (\(\beta\)’s) are fixed and thus, known. Then, we check if it is able to recover the true population parameters by applying it to the simulated data.

simulation parameters fixed parameters ratio
beta0 -8.787 -8.798 1.001
beta1 0.083 0.114 1.375
beta2 0.035 0.035 0.990
beta4 0.109 0.099 0.909
beta5 0.873 0.896 1.026
beta6 0.006 0.011 1.783

As we can notice, the approximation ratio is \(\approx 1\) for almost all the features. We can conclude that the model is able to recover the true parameter and confirm its adequacy.

Evaluation

\[~\] We want to test our model by evaluating its predictive performances over new observations (test data);

Compare the metrics

\[~\] Here they are reported the results of our Bayesian Logistic Regression Classifier,with the canonical cut-off level = 0.5, compared to the ones obtained by using different Machine Learning models, namely:

  • Frequentistic Logistic Regression
  • Random Forest
  • Support Vector Classifier
Bayesian_LR Frequentistic_LR RandomForest SVM
Accuracy 0.7884615 0.7740385 0.8317308 0.7788462
Precision 0.8155340 0.9029126 0.8476190 0.7961165
Recall 0.7706422 0.7153846 0.8240741 0.7663551
F1-score 0.7924528 0.7982833 0.8356808 0.7809524

As we can notice, Random Forest is the best performing model; nevertheless, our Bayesian Logistic Regression has an accuracy higher than both Frequentistic Logistic Regression and Support Vector Classifier, despite a lower Precision than the former. An important remark is that we are mainly interested in achieving an high recall, since we want to detect as many Diabetic people as possible to allow them to preventive care; our model overtakes both Frequentistic LR and SVM.